Python dataframes with pandas and polars

Andreas Beger and Isaac Chung
PyData Tallinn x Python CodeClub
27 November 2024

Bios

Andreas Beger

  • 🏢 Data Scientist, Consult.
  • 🏃‍♂️🐌 Slow marathoner
  • 📍 🇩🇪/🇭🇷 → 🇺🇸 → 🇪🇪
  • 🎓 PhD Political Science

Isaac Chung

  • 🏢 Staff Data Scientist, Wrike
  • 🏊‍♂️🚴🏃‍♂️ Fast triathloner
  • 📍 🇭🇰 → 🇨🇦 → 🇪🇪
  • 🎓 MS Machine Learning

🐍 We are also the PyData Tallinn co-organizers.

Getting setup

Instructions for how to follow along in notebooks…GitHub codespaces?

What are dataframes?

Definition

Tables, 2d arrays, etc.

Why?

Show a list of python dictionaries

vs

pandas data frame

Common dataframe operations

  • 📖 ✍️ read and write
  • 🔬 inspect
  • 🔍 filter rows
  • 🛒 select columns
  • 🥪 mutate, add columns
  • 🤝 join other dataframes
  • 👨‍👩‍👧‍👦 group and aggregate
  • 🧱 reshape wide, long

pandas

History

Wes McKinney

originally built on top of numpy pandas 2 () adds support for arrow backend

Getting started

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "quarter": [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    "x": np.random.randn(12),
    "date": pd.date_range("2024-01-01", periods=12, freq="MS")
})

df.head()
quarter x date
0 1 1.009997 2024-01-01
1 1 1.451022 2024-02-01
2 1 -0.216127 2024-03-01
3 2 -0.761972 2024-04-01
4 2 -0.409263 2024-05-01

Import accidents data

pandas is great


2017, Wes McKinney (creator of pandas):

10 Things I Hate About Pandas

  • Inefficient memory management, need 5-10x data size
  • Eager evaluation → limited query planning
  • No multi-core

Arrow

2016 Apache Arrow as common in-memory data representation standard

polars

History

2020 Ritchie Vink

Uses arrow as internal representation

new slides

  • Out with indices
  • Out with .loc, .iloc
  • Out with [
  • In with lazy evaluation
  • Expressions

Easy to convert between the two

df = df.to_pandas()
df = pl.from_pandas(df)

example

go through same example again, but with polars

The big picture

Andy is a polars stan

Comparison

pandas

  • ✅ Very widely used and supported
  • ✅ Stable
  • ❓ More imperative, traditional API
  • ❌ Inconsistent API, multiple ways of doing the same thing

polars

  • ✅ More consistent, functional-style API
  • ✅ Faster, less memory footprint
  • ✅ Works with OOM datasets out of the box
  • ❌ API still changing

Other frameworks

  • Narwhal
  • DuckDB

Thank you!

QR code for feedback?